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ABSTRACT 


Blocking  (Tiling)  techniques  of  iteration  spaces  to 
increase  data  reuse  in  the  cache  were  reviewed.  Results 
consistent  with  those  previously  published  were  experimentally 
obtained.  The  relation  between  the  sizes  of  the  declared 
matrix  and  the  cache  was  studied.  Based  on  this  relation,  two 
algorithms  were  presented.  Both  algorithms  attempt  to 
increase  the  critical  blocking  factor  with  no  self¬ 
interference  (Be)  by  changing  the  declared  matrix  cize. 
Furthermore,  the  execution  time  of  the  second  algorithm  is 
independent  of  the  matrix  size.  Experiments  based  on  these 
algorithms  were  performed  which  showed  a  consistent  superior 
performance  (in  terms  of  Mflops)  relative  to  the  performance 
obtained  using  previously  published  algorithms  for  deriving 
Be. 
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Z.  INTRODUCTION 


A.  PERFORMANCE  IMPROVEMENT  IN  A  COMPUTER 
1.  The  Memory  Bottleneck 

An  essential  component  of  every  computer  is  its 
memory.  Without  memory  there  could  be  no  computers  as  we  now 
know  them.  There  is  an  important  axiom  of  hardware  design: 
smaller  Is  faster  [Ref.  18] .  Smaller  pieces  of  hardware  will 
generally  be  faster  than  larger  pieces.  In  high-speed 
machines,  signal  propagation  is  a  major  cause  of  delay,  and 
larger  memories  have  more  signal  delay  and  require  more  levels 
to  decode  addresses.  A  basic  attribute  used  to  measure  the 
effectiveness  of  a  memory  configuration  is  the  memory  band¬ 
width.  This  is  the  number  of  words  that  can  be  accessed  per 
second  or  the  rate  of  information  transfer.  Incr-^sing  the 
computational  power  without  a  corresponding  increase  in  the 
memory  bandwidth  of  data  to  and  from  memory  can  create  a 
serious  bottleneck  [Ref.  18].  In  other  words,  increasing 
memory  bandwidth  and  decreasing  the  latency  (execut .  on  time)  of 
memory  access  are  both  crucial  to  system  performance.  Since 
these  two  measures  are  so  much  important,  we  should  improve 
them  to  obtain  a  better  performance. 
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2 .  Cache  Memory 

The  analysis  of  a  large  number  of  computer  programs 
has  shown  that  during  program  execution,  memory  references 
tend  to  occur  in  very  localized  patterns.  Consider,  for 
example,  a  block  of  contiguous  memory  locations.  If  this 
block  consists  of  straight  line  code,  then  execution  will 
proceed  sequentially  through  the  block.  If  the  block  repre¬ 
sents  a  data  array,  then  it  is  likely  that  the  block  will 
often  be  used,  not  necessarily  in  a  sequential  order,  but  at 
least  as  a  whole.  This  characteristic  of  programs  is  referred 
to  as  the  locality  of  reference  principle.  When  a  program  is 
stored  in  main  memory,  the  access  time  for  fetching  a  memory 
word  is  a  function  of  the  capacity  (size)  of  the  memory  and  not 
a  function  of  the  size  of  a  local  block  in  the  memory  [Ref. 
19]  .  Hence,  to  get  the  benefit  from  the  locality  of  reference 
principle,  high-speed  buffer (s)  can  be  inserted  between  the 
processor (s)  and  main  memory  to  capture  those  active  portions 
(blocks)  of  the  contents  of  main  memory  currently  in  use. 
Then,  rather  than  fetching  the  word  at  the  next  location  from 
the  main  memory,  it  could  be  fetched  much  more  quickly  from 
this  special  high-speed  buffer  called  cache  memory  and 
implemented  in  most  computers  as  shown  in  Figure  1.  The  major 
reasons  for  having  a  memory  hierarchy  are  to  get  a  better 
performance  and  to  reduce  execution  time,  not  misses  [Ref. 
18] .  If  cache  misses  can  be  reduced  by  some  means,  then  the 
overall  performance  of  processors  will  improve. 
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Figure  1 

Memory  Hierarchy 

To  fully  realize  the  potential  of  the  processors,  the 
faster  levels  of  the  memory  hierarchy  such  as  cache  memory 
must  be  efficiently  utilized.  But,  a  distinct  characteristic 
of  numerical  applications  is  that  they  tend  to  operate  on 
large  data  sets  where  as  a  cache  may  only  be  able  to  hold 
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small  fraction  of  a  matrix.  Thus  even  if  the  data  are  reused, 
they  may  have  been  displaced  from  the  cache  by  the  time  they 
are  reused,  causing  a  high  miss  ratio  in  the  cache.  For  this 
reason,  blocking  techniques  are  utilized  for  the  goal  of 
reducing  memory  traffic  and  exploiting  this  data  reuse. 
Blocking  techniques  restructure  the  code  to  move  references  to 
the  same  memory  location  closer  together,  hence  improving 
cache  performance.  Blocking  can  be  applied  at  different 
levels  of  the  memory  hierarchy,  including  physical  memory, 
caches,  and  registers  [Ref.  9].  In  the  experiments,  blocking 
has  been  applied  both  at  data  cache  and  register  levels, 

B.  IMPROVING  DATA  LOCALITY 

1.  Types  of  Misses  in  Cache 

Cache  misses  occur  in  one  of  three  forms: 

*  Compulsory  misses:  The  first  access  to  a  block  is  not  in 
the  cache,  so  the  block  must  be  brought  into  the  cache. 
These  are  called  cold  start  misses  or  reference  misses. 

*  Capacity  misses:  If  the  cache  cannot  contain  all  the 

blocks  needed  during  a  program  execution  of  a  program, 
capacity  misses  will  occur  due  to  blocks  being  discarded 
and  later  retrieved. 

*  Conflict  misses:  If  the  block-placement  strategy  is  set 
associative  or  direct  mapped,  conflict  misses  (in  addition 
to  compulsory  and  capacity  misses)  will  occur  because  a 
block  can  be  discarded  and  later  retrieved  if  too  many 
blocks  map  to  its  set.  These  are  also  called  collision 
misses  [Ref.  19], 

Conceptually,  conflict  misses  could  be  the  easiest  to 
avoid:  Fully  associative  placement  avoids  all  conflict  misses 
by  mapping  an  address  in  the  main  memory  to  any  cache  block. 
But,  this  associativity  is  expensive  in  hardware  and  may  also 
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slow  down  the  access  time,  leading  to  a  lower  overall  perfor¬ 
mance  [Ref.  19].  Therefore,  fully  associative  caches  are  not 
built.  Capacity  misses  can  be  reduced  by  using  a  large  cache. 
Larger  cache  lines  reduce  the  number  of  compulsory  misses,  but 
this  may  lead  to  an  increase  in  conflict  misses. 

2.  Transformation  (Restructuring) 

If  a  program  can  be  transformed  into  a  version  that 
makes  better  use  of  the  cache,  then  the  number  of  requests  to 
memory  can  be  reduced,  thus  improving  tne  execu'rion  time. 
Program  transformations  that  move  consecutive  uses  of  a  memory 
location  closer  together  can  increase  the  number  of  times  a 
value  is  used  before  it  is  replaced.  A  cache  miss  removed  by 
code  transformation  will  require  less  traffic  since  the  niomber 
of  times  a  value  is  loaded  into  the  cache  is  decreased  [Ref. 
7]  . 

The  basic  theory  of  this  transformation  (restructur¬ 
ing)  process  is  based  on  data  dependence  analysis .  Data 
dependence  information  is  needed  to  test  whether  restructuring 
transformations  are  legal  (the  program  produces  the  same 
answer  after  restructuring  as  it  did  before)  [Ref.  13].  Data 
dependence  analysis  will  be  explained  in  detail  in  Chapter  II. 
Briefly,  it  is  a  tool  which  is  applied  to  partition  a  serial 
program  into  blocks  of  rode  that  contain  well-defined  depen¬ 
dences.  The  purpose  of  dependence  axii..lysis  is  to  prove  the 
presence  or  absence  of  data  dependence  between  the  oiocks 
[Ref.  16].  These  blocks  may  be  converted  into  independent 
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tasks,  scheduled  onto  one  or  more  parallel  processors,  and  run 
as  a  parallel  program.  The  problem  gets  complex  and  difficult 
when  the  blocks  are  control  structures  such  as  loops, 
branches,  and  procedures. 

Loops  are  difficult  to  analyze,  and  yet  they  are  the 
best  candidate  for  optimizations  obtained  by  transformation 
techniques  such  as  vector ization,  fusion,  coalescing,  distri¬ 
bution,  interchange,  node  splitting,  shrinking,  unfolding, 
skewing,  and  blocking  transformation  tecbniguea  [Ref.  16]. 

3 .  Performance  Experiments  with  Blocked  Codes 

In  this  thesis,  we  experiment  with  techniques  to 
improve  data  cache  performance  with  blocked  codes.  We  apply 
these  techniques  on  two  machines:  DECstation  3100  and 
Solbourne  S4000  SPARC  station.  Together  with  blocking,  we 
investigate  the  effect  of  further  optimizations  such  as 
unfolding  (unrolling)  on  the  data  cache  performance. 

Blocking  techniques  are  used  to  increase  the  life  time 
of  the  piece  of  data  in  the  cache.  Therefore,  this  allows 
reused  data  to  still  be  in  the  cache  and  reduces  the  memory 
accesses.  Blocking  techniques  improve  cache  performance  by 
restructuring  the  code  to  move  references  to  the  same  memory 
location  closer  together,  hence  eliminating  cache  misses.  To 
illustrate  the  benefits  of  blocking,  we  handle  matrix  multi¬ 
plication  code,  showing  how  it  is  blocked  and  how  the  reuse  of 
data  in  that  piece  of  code  can  be  exploited.  A  basic  matrix 


multiplication  computes  an  inner  product  of  a  row  and  a  column 
of  matrices  B  and  C  for  each  element  of  the  result  matrix  A. 

for  (I  =  1;  I<N;  -h+I)  { 

for  (J  =  1;  J<N;  ++J)  { 

for  (K  =  1;  K<N;  ++K)  { 

C(I,J)  +=  A(I,K)*B(K,J)  ; 

} 

} 

} 

Figure  2 

Main  loops  of  Matrix  Multiplication 

We  assume  that  we  have  a  direct-mapped  data  cache  and  the 
arrays  are  stored  rowwise  in  cache  lines.  If  we  substitute 
some  loop  bounds  for  the  code  above,  then  we  can  show  the 
potential  in  it  for  the  reuse  easily.  In  Example  1  below, 
matrix  multiplication  code  is  utilized  and  N  has  the  value  2. 


M. 

+=  A(l,  1) 

*  B(l,  1) 

[I  =  1, 

j  =  1 

-  2, 

II 

2) 

+=  A(l,  1) 

*  B(l,  2) 

JJ. 

+=  A(l,  2) 

*  B(2,  1) 

[I  =  1, 

j  =  1 

-  2, 

K  =  2] 

M. 

+=  A(l,  2) 

*  B(2,  2) 

C(2, 

JJ. 

+=  A(2,  1) 

*  B(i,  i; 

[I  =  2, 

J  =  1 

-  2, 

K  =  1] 

C{2, 

-21 

+  =  A(2,  1) 

*  Bfl.  2) 

C(2, 

J1 

+=  A (2,  2) 

*  B(2,  1) 

H 

II 

II 

i-> 

-  2, 

K  =  2] 

CJ2^ 

J1 

+=  A (2,  2) 

*  B(2,  2) 

As  seen  above,  there  are  array  variables  which  are  repeatedly 
referenced  and  retrieved  from  the  memory.  For  example,  both 
A(l,  1)  is  referenced  when  K  =  1  for  different  iterations  of 
J  loop.  A(2,  2)  present  the  same  behavior  when  K  =  2.  All 
these  repeating  references  cause  unwanted  cache  traffic  (if 
the  cache  is  not  large  enough  to  hold  the  whole  A  array)  . 
Here  we  assume  that  we  have  a  data  cache  size  of  2^  =  4  words 
and  the  reused  array  elements  in  the  first  four  lines  of 
Example  1  are  placed  in  this  cache  as  shown  in  Figure  3 . 
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Figure  3 

Data  Cache  and  Main  Memory 

To  provide  this  type  of  placement  in  the  Cc  we  can  use 

subblocks  of  the  arrays  as  follows: 


Ci.i 

< 

II 

★ 

Bi.i 

+ 

A1.2 

* 

B2,1 

CU2 

= 

★ 

B1.2 

+ 

A1.2 

* 

B2,: 

C2.1 

= 

★ 

Bi,i 

+ 

A2.2 

* 

B2.i 

C2,2 

=  A^'i 

★ 

B1.2 

+ 

A2.2 

* 

B2,: 

The  following  figure  shows  these  subblocks  explicitly. 
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Figure  4 


Subblocks  in  a  matrix  multiplication  code  when 
matrix  size  is  2 

This  subblocking  exhibits  the  advantage  that  the 
blocks  can  be  sized  to  fit  into  the  fastest  level  of  the 
memory  hierarchy  such  as  cache  memory,  and  by  this  way  the 
data  in  each  submatrix  is  used  many  times  during  each  matrix 
multiplication.  These  benefits  of  subblocking  can  be  enhanced 
via  program  transformations  which  form  the  basis  of  blocking 
techniques .  The  main  loops  of  matrix  multiplication  code  in 
Figure  2  are  blocked  and  shown  in  Figure  5 . 
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for  (KK=0;  KK<N;  KK=KK+B)  { 


for  (JJ=0;  JJ<N;  JJ=JJ+B)  { 
for  (1=0;  I<N;  ++I)  { 

for  (K=KK;  K<min  (N,  KK+B)  ;  ++K)  { 

C[I][J]  =  0; 
temp  =  A [I]  [K]  ; 

M  =  min(N,JJ+B); 
for  (J=JJ;  J<M;  ++J)  { 

C[I][J]  =  C[I][J]  +  temp*B[K]  [J] ; 

} 

} 


Figure  5 

Main  Loops  o£  Blocked  Matrix  Multiplication 


4 .  Previous  Work  on  Blocking 

It  has  long  been  known  that  program  restructuring  can 
dramatically  reduce  the  load  on  a  memory'  hierarchy  subsystem 
([1]/  [2],  [3]).  Previous  research  on  blocking  focused  on  how 
to  block  an  algorithm  manually  and  automatically  ( [5]  ,  [6]  , 

[7],  [8]).  Irigoin  and  Triolet  [Ref.  4]  describe  a  procedure 
to  partition  the  iteration  space  of  a  tightly-loop  into 
supernodes,  where  each  supernode  covers  a  scL  of  iterations 
that  will  be  scheduled  as  an  atomic  task  on  a  processor.  That 
procedure  works  from  a  new  data  dependence  abstraction,  called 
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the  dependence  cone.  The  dependence  cone  is  used  to  find 
legal  partitions  and  to  find  dependence  constraints  between 
supernodes . 

Lam,  Rothberg,  and  Wolf  [Ref.  9]  show  that  the  degree 
of  cache  interference  is  highly  sensitive  to  the  stride  of 
data  accesses  and  the  size  of  the  blocks,  and  can  cause  wide 
variations  in  machine  performance  for  different  matrix  sizes. 
They  presented  cache  performance  data  (obtained  via 
simulation)  for  blocked  programs  and  evaluate  several 
optimizations  such  as  using  a  fixed  blocking  factor  and 
copying  non-contiguous  data  to  be  reused  into  a  contig-uous 
area.  They  found  that  trying  to  use  the  entire  cache,  or  even 
a  fixed  fraction  of  the  cache  is  not  so  useful  and  propose  an 
algorithm  to  tailor  the  block  size  according  to  matrix  size  to 
improve  the  average  miss  rate. 

Hong  and  Kung  [Ref.  10]  claim  that  the  optimal 
blocking  factor  is  roughly  SQRT(C)  for  matrix  multiplication 
on  a  machine  with  a  local  memory  of  C  words . 

In  an  algorithm  implemented  in  Stanford  University 
Intermediate  Format  Compiler  [Ref.  8],  the  locality  of  a  loop 
nest  by  transforming  the  code  via  interchange,  reversal, 
skewing,  and  blocking  is  improved. 

C.  OROANIZATZQN  OT  THB  THBSXS 

In  Chapter  ll,  we  focus  on  existing  blocking  techniques, 
by  explaining  their  advantages  and  related  terminology.  We 
also  present  data  obtained  from  the  experiments  with  blocked 
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codes  to  observe  the  effectiveness  of  these  blocking  tech¬ 
niques.  In  Chapter  III,  we  propose  an  enhanced  blocking 
technique  which  improves  the  performance  of  workstations  in 
some  scientific  applications.  Conclusions  and  recommenda¬ 
tions  for  further  research  are  offered  in  Chapter  IV. 
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II.  EXPERIMENTING  WITH  BLOCKING 


A.  BLOCKING  TECHNIQUES 

1.  Program  Restructuring 

Blocking  techniques  improve  cache  performance  by 
restructuring  the  code,  hence  eliminating  cache  misses. 
Advanced  compilers  or  supercompilers  are  capable  of  many 
program  restructuring  transformations  such  as  vectorization, 
strip  mining  and  loop  interchanging  [Ref.  13].  These 
transformations  transform  a  program  into  a  version  that  makes 
better  use  of  the  cache,  thus  the  number  of  requests  to  memory 
is  reduced  and  the  execution  time  is  improved.  But  super¬ 
compilers  cannot  transform  every  program  efficiently;  the 
quality  of  the  results  will  depend  on  the  structure  of  the 
algorithm  and  the  architecture  of  the  target  machine.  Data 
dependences  [Ref.  21]  imply  precedence  constraints  among 
computations  which  have  to  be  satisfied  for  a  correct 
execution.  So,  when  determining  the  loops  for  restructuring, 
the  existence  of  these  dependences  must  be  considered.  In  the 
next  sections,  concepts  such  as  data  dependence,  data 
dependence  graph  and  iteration  space  will  be  described, 
a.  Data  Dependence 

Two  types  of  dependence  occur  in  computer 
programs.  Control  Dependence  is  a  consequence  of  the  flow  of 
control  in  a  program.  Execution  of  a  statement  in  orie  path 
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under  an  if  test  is  dependent  cn  the  if  test  taking  the  path. 
Thus,  the  statement  under  control  of  the  if  is  control 
dependent  upon  the  if  test.  Data  Dependence  is  a  consequence 
of  the  flow  of  data  in  a  program.  The  value  of  an  expression 
is  dependent  upon  the  values  of  the  variables  used  in  the 
expression.  Therefore,  a  statement  which  uses  a  variable  in 
an  expression  is  data  dependent  upon  the  statement  which 
computes  the  value  of  the  variable. 

In  programming  languages,  such  as  C,  Fortran,  and 
Pascal,  three  kinds  of  data  dependence  may  occur:  flow- 
dependence  (  or  true  dependence) ,  anti -dependence,  and  output- 
dependence.  The  first  dependence  relation  occurs  when  a  value 
computed  (stored)  in  a  statement  is  used  ( fetched)  in  some 
statement  S„;  we  say  that  S^,  is  data  flow-dependent  on  Sv  and 
write  this  as  Sv  &  S„.  This  type  of  data  dependence  relation 
shows  how  the  data  flows  between  the  statements  of  the 
program.  The  second  kind  of  data  dependence  occurs  when  an 
item  is  used  in  statement  Sv  before  that  item  is  reassigned  in 
some  statement  S„;  we  say  that  S„  is  data  anti -dependent  on  Sv 
before  that  item  is  reassigned  in  some  statement  a’^d  write 
this  as  Sv  &■  S„.  The  last  kind  of  data  dependence  occurs 
when  an  item  is  assigned  in  statement  Sv  before  that  item  is 
reassigned  in  some  statement  S„;  we  say  that  S,,  is  data  output- 
dependent  on  S  and  write  this  as  Sv  &°  S*,  [Ref.  21].  We 
illustrate  these  dependencies  in  Example  2. 
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Bxaiqple  2. 


51  :  A  =  B  +  D 

52  :  C  =  A  *  3 

53  :  A  =  A  +  C 

54  :  E  =  A  /  2 


The  dependence  relations  for  this  code  are: 

S2,S3,S4  are  data  flow-dependent  on  S1,S2,S3  respectively; 

S3  is  also  data  flow -dependent  on  SI; 

S3  is  data  anti -dependent  on  S2; 

S3  is  data  output -dependent  on  SI . 

b .  Dependence  Orapb 

A  dependence  relation  is  a  precedence  relation 
which  requires  execution  of  one  statement  before  another.  A 
parallelizing  (or  optimizing)  compiler  can  build  a  dependence 
graph  by  using  these  dependences.  In  a  dependence  graph, 
nodes  represent  statements  in  the  program  and  directed  arcs 
represent  dependence  relations  [Ref.  21].  The  real  advantage 
of  dependence  graphs  is  their  ignoring  the  arbitrary  ordering 
of  statements  and  concentrating  on  the  dependence  precedence. 
The  dependence  graph  for  Example  2  is  depicted  in  Figure  6: 
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Figure  6 

Dependence  Graph  for  the  code  Example  2 


c.  Xteration  Space 

A  loop,  when  there  is  considerable  potenti;:  -^r 
code  optimization,  can  be  said  to  describe  an  iteration  space. 
A  single  for  loop  describes  a  one-dimensional  iteration  space, 
one  axis  of  a  Cartesian  coordinate  system.  Each  iteration  of 
the  for  loop  corresponds  to  a  point  along  this  axis.  The  for 
loop  will  visit  the  points  along  this  axis  in  a  specific 
order,  as  defined  by  the  for  statement.  If  we  have  two  nested 
for  loops,  then  they  describe  a  two-dimensional  iteiauicr 


space . 


for  (I,  =  1;  I, <5;  ++I.)  { 


for  (I2  =  1;  l2<^!  ++I2)  ( 

5,  :  A(Ij,  I2)  =  B(I,,  I2)  +  C(Ij,  I2) 

S2  :  B(Ij,  l2,i)  =  A(Ij,  I2)  +  B(Ij,  I2) 

} 

} 

Figure  7 

Two-nested  serial  loops 

The  loops  above  define  a  two-dimensional  5X4  iteration  space 
illustrated  in  Figure  8. 


II 
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1 

2 

12  3 

4 
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X  X  X  X 

X  X  X  X 

X  X  X  X 

X  X  X  X 

X  X  X  X 

Figure  8 


Iteration  space  for  the  code  in  Figure  7 


The  shape  of  the  iteration  space  does  not  need  to 
be  rectangular.  The  iteration  space  for  the  loops  below  is 
triangular,  so  it  is  called  as  a  triangular  loop.  Other 
iteration  space  shapes  can  be  defined  by  nested  loop. 
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for  (i  =  1;  i<5;  ++i)  { 

for  (j  =  i;  j<10;  ++j)  { 

A(i,j)  =  B(i,j)  +  C  a,  j)  *  D{i,j) 

} 

} 

Figure  9 
Triangular  Loop 

In  this  research,  a  loop  transformation  technique 
blocking  (iteration  space  tiling)  is  implemented  cn  two 
different  machines.  The  code  used  in  the  experiments  is 
optimized  by  utilizing  blocking  optimizations  to  get  better 
speedups .  We  also  utilized  another  optimization  technique 
called  loop  unfolding  on  the  blocked  code  but  it  did  not  help 
the  performance  of  the  data  cache.  Blocking  utilizes  two 
other  transformation  techniques,  loop  interchanging  and  strip 
mining. 

d .  Loop  Interchanging 

One  of  the  most  important  restructuring  transfor¬ 
mations  is  loop  interchanging.  The  simplest  example  of  loop 
interchanging  is  interchanging  two-nested  loop,  such  as  the 
loop  below:  ^ 
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for  (I  =  1;  I<N;  ++I)  { 
for  (J  =  1;  J<N;  ++J)  { 

A(I,J+1)  =  A(I,J)  *  B(I,K)  +  C(K,J) 

} 

} 

Figure  10 
Two-Nested  Loops 

The  nested  loop  above  is  a  first  order  linear  recurrence;  most 
vector  computers  have  no  corresponding  instruction,  so  the 
loop  would  be  executed  serially,  or  with  some  fast  recurrence 
algorithm.  By  interchanging  the  loops: 

for  (J  =  1;  J<N;  ++J)  { 
for  (I  =  1;  I<N;  ++I)  { 

A(I,J+1)  =  A(I,J)  *  BII,K)  +  C(K,J) 

} 

} 

Figure  11 

Two-Nested  Loops  After  the  Interchange 

The  inner  loop  is  vectorizable.  Not  all  loop  interchanges  are 
legal.  For  instance,  the  loops  below  can  not  be  interchanged: 
for  (I  =  2;  I<N;  ++I)  { 

for  fj  =  1;  J<N-1;  ++J}  { 

A(I,J)  =  A(I-1,J+1)  +  B(I,K) 

} 

} 
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The  original  loop  uses  newly  computed  values  of  the  array  A  on 
the  right  hand  side;  if  the  loops  are  interchanged,  the 
transfoirmed  loop  will  use  only  old  values  of  A  on  the  right 
hand  side.  The  transformed  loop  will  compute  different 
results  and  the  transformation  is  therefore  illegal. 
e .  Strip  Mining 

Vectorizing  compilers  often  divide  a  single  loop 
into  a  pair  of  loops,  where  the  maximum  trip  count  of  the 
inner  loop  is  equal  to  the  maximum  vector  length  the 

machine.  The  loop  in  Example  3. a.  can  be  converted  into  a 
pair  of  the  loops  in  Example  3.b  by  a  vectorizing  compiler. 
This  process  is  called  strip  mining.  The  original  loop  is 
divided  into  strips  of  some  maximum  size,  the  strip  size,  in 
Example  3.b.,  the  inner  loop  (or  element  loop)  a  strip 

size  of  B,  which  is  the  length  of  the  vector  register  in  the 
compiler.  The  outer  loop  {IS  loop,  for  strip  loop)  steps 
between  the  strips; 

Exan^le  3. a. 

for  (I  =  1;  I<N;  ++I)  { 

Sj  :  A  (I)  =  A  (I)  +  B(I) 

S2  :  C(I)  =  A(I-l)  *  2 

} 
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Example  3.b. 

for  (IS  =  1;  I<N;  IS  =  IS  +  B)  { 

for  (I  =  IS;  I<min(N,  IS  +  B) ;  ++I)  { 

Sj  :  A  (I)  =  A(I)  +  B(I) 

S2  :  C(I)  =  A{I-1)  *  2 

} 

} 

2.  Blocking  (Iteration  Space  Tiling) 

a.  Blocking  Algorithms 

Blocking  algorithms  have  been  developed  for  the 
goal  of  reducing  memory  traffic.  Its  advantage  is  that  the 
blocks  can  be  sized  to  fit  into  the  fastest  level  of  the 
memory  hierarchy,  and  that  during  each  computation,  the  data 
in  each  block  is  used  many  times.  Blocking  tries  to  duplicate 
the  benefits  of  block  algorithms  via  program  transformations. 

b.  Blocking  (Tiling) 

We  define  Blocking  as  dividing  the  iteration  space 
into  blocks  (tiles)  of  some  size  and  shape  (typically  squares 
or  cubes),  and  traversing  between  the  blocks  to  cover  the 
whole  iteration  space.  Optimal  blocking  for  a  memory 
hierarchy  will  find  blocks  such  that  all  the  data  for  each 
block  will  fit  into  the  highest  level  of  the  memory  hierarchy 
[Ref.  13].  This  will  reduce  the  number  of  intervening 
iterations  and  data  fetched  between  data  reuses.  The  reused 
data  will  still  be  in  the  cache,  and  hence  memory  accesses 
will  be  reduced.  In  other  words,  the  traversal  between  the 
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blocks  will  follow  an  order  that  will  reduce  the  amount  of 
data  that  needs  to  be  moved  when  going  to  the  next  tile.  To 
be  most  effective  at  blocking  (tiling) ,  block  size  must  be 
tuned  to  fit  into  cache  memory  to  allow  the  minimum  number  of 
misses  to  occur  and  generate  the  least  traffic  between  the 
memory  levels. 

Blocking,  as  a  combination  of  strip  mining  and 
loop  interchanging,  can  improve  codes.  The  goal  is  to  strip 
the  innermost  loop  into  pieces  such  that  the  new  innermost 
loop  fits  entirely  into  cache  memory.  The  newly  created 
middle  loop  is  then  moved  to  the  outer  loop.  Figure  12  below 
illustrates  how  blocking  is  applied  to  loops  whenever  it  is 
legal . 


for  (I  =  1;  I<N;  ++I)  { 

for  (J  =  1;  J<N;  ++J)  { 
loop  body 

} 

} 

becomes 

for  (JJ  =  1;  JJ<N;  JJ  =  JJ  +  B)  { 
for  (I  =  1;  I<N;  ++I)  { 

for  (J  =  JJ;  JJ<B;  ++J)  { 
loop  body  }  }  } 

Figure  12 

Blocking.  B  Is  the  strip  size 
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A  further  optimized  version  of  the  code  in  Figure  12  is 
illustrated  in  Figure  13.  The  inner  two  loops  iterate  in 
square  shaped  blocks  (BxB)  while  the  outer  two  loops  step 
between  the  blocks . 


for  (II  =  1;  II<N;  II  =  II  +  B)  { 

for  (JJ  =  1;  JJ<N;  JJ  =  JJ  +  B)  { 

for  (I  =  II;  I<min(N,  II  +  B) ;  ++I)  { 

for  (J  =  JJ;  J<min(N,  JJ  +  B) ;  ++J)  { 

loop  body 

} 

} 

} 

} 


Figure  13 

Further  Optimized  version  of  the  code  in  Figure  12 


The  iteration  space 
for  the  above  code 
is  shown  in  Figure 
15. 


Figure  14 

Iteration  Space  for  the  code  in  Figure  13 
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B .  PBRFOBMAMCB  EXPERIMENTS  WITH  BLOCKED  MATRIX  MULTIPLICATION 

CODE 

The  experiments  are  performed  on  two  different  machines 
located  in  The  Naval  PostGraduate  School  Matrix  multiplica¬ 
tion  is  utilized  as  the  main  source  code  written  in  C 
language.  Matrix  multiplication  is  a  building  block  in  many- 
linear  algebraic  algorithms.  It  is  also  an  interesting  case 
study  because  locality  is  carried  in  three  different  loops  by 
three  different  variables.  The  entire  computation  in  a  matrix 
multiplication  involves  2N^  arithmetic  operation?  (counting 
additions  and  multiplications  separately) ,  but  produces  and 
consumes  only  3N^  data  values.  As  a  whole,  the  computation 
exhibits  "admirable  reuse  of  data"  [Ref.  23].  In  general, 
however,  an  entire  matrix  will  not  fit  in  a  small  data  cache 
memory.  So,  the  code  is  restructured  such  that  the  necessary 
reuse  of  data  is  achieved  in  cache. 

1 .  Experimental  Setup 

In  our  experiments,  three  different  matrix  sizes  (293, 
295,  300)^  are  utilized  and  the  performance  of  blocked  matrix 
multiplication  is  observed  by  experimenting  with  square  shaped 
blocks  having  sizes  of  multiples  of  8.  To  ensure  the  accuracy 
of  the  timing  results  obtained,  each  experiment  has  been  run 
5  times  by  using  the  library  function  system ()  which  provides 
access  to  operating  system  commands. 


^These  values  were  chosen  so  that  we  compare  our  results  to 
those  in  [Ref.  9]. 
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The  "MFLOPS"  rating  of  the  machine  was  utilized  as  the 
performance  measure  for  the  blocked  portion  of  the  code,  i.e., 
the  loops.  We  had  to  find  the  time  of  that  portion.  The 
command  "time"  provided  in  C  compiler  is  not  appropriate  to 
measure  for  that  type  of  computation.  Because,  it  is  an 
executable  program  available  when  using  the  shell  and  it 
measures  the  time  of  an  executable  program  from  the  beginning 
to  the  end.  So,  in  our  experiments,  another  structure  called 
getrusage  is  used,  "getrusage"  returns  the  user  time  utilized 
by  the  current  process,  or  all  its  terminated  child  processes. 
By  calling  it  twice,  at  the  beginning  and  at  the  end  of  the 
nested  loops,  we  measured  the  time  we  were  looking  for.  The 
code  used  in  the  experiments  is  presented  in  Figure  15  below. 

float  Timingl,  Timing2,  Difference,  Performance; 
struct  rusage  *x,  *y;  /*  declaration  */ 

/*  other  declarations  and  lines  of  code  will  be  here*/ 

X  =  (struct  rusage  *)  malloc  (sizeof (struct  rusage)); 
y  =  (struct  rusage  *)  malloc  (sizeof (struct  rusage) ) ; 
if  (getrusage  (RUSAGE_SELF,  x)  ==  -1) 
perror  ( "getrusage  ()  ”)  ; 

/*  Nested  loops  are  put  here  */ 
if  (getrusage  (RUSAGE_SELF,  y)  ==  -1) 
perror  ( "getrusage  ()  " ) ; 

Timingl  =  (y->ru_utime.tv_usec  -  x->ru_utime.tv_usec)/1000; 
Timing2  =  y->ru_utime.tv_usec  -  x->ru_utime.tv_usec); 
if  (Timingl  <  0)  { 


26 


Timing2  =  Timing2  -  1; 

Timingl  =  1000  +  Timingl; 

} 

Difference  =  Timing2  +  Timingl  /  1000; 

Performance  =  (2*N*N*N)  /  (1000U00*Difference);  /*  MFLGPS  */ 

Figure  15 

The  code  to  compute  the  performance  of  the  cache 

During  the  experiments,  we  needed  to  edit  files 
quickly  to  change  the  values  for  matrix  and  block  size.  We 
intensely  used  the  Unix  editor,  sed,  to  make  changes  to  the 
files  on  the  command  line.  Also,  usage  of  Tschell  (shell) 
together  with  sed  allowed  the  task  of  making  frequent  changes 
to  the  source  files  by  the  processor  less  tedious. 

2.  Overview  of  The  Targeted  Machine  Architectures 

The  two  machines  used  in  the  experiments  are  Solbourne 
S4000  SPARC  Station  and  DECstation  3100.  The  Solbourne  S4000 
SPARC  Station  has  a  64-bit  SPARC  CPU,  with  a  2kB  2-way  set- 
associative  write-back  physical  data  cache  (DCACHE)  and  6kB  3- 
way  set-associative  instruction  cache  (ICACHE)  .  The  processor 
performs  a  load/store  instruction  in  one  clock,  achieving  25.5 
MIPS.  In  SPARC  microprocessor,  all  performance-critical 
element — integer  CPU,  floating  point  processor,  memory 
management  unit  and  cache — are  integrated  on  a  single  chip. 
The  64  Dit  memory  bus  supports  a  data  transfer  rate  of  60 
Mbytes  per  second  to  accommodate  the  high  performance 
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requirements  of  the  processor.  The  DECstation  3100  has  an  8kB 
double  word  direct -mapped  cache  and  can  hold  all  the  words 
reused  within  an  88x88  block.  For  this  reason,  experiments 
are  also  done  with  block  sizes  over  88x88  in  order  to  measure 
the  real  performance.  Otherwise,  the  data  cache  can  hold 
88x88  (or  smaller)  blocks  and  the  effects  of  self-interference 
misses  may  not  be  observed. 

3.  Blocking  Experiments 

The  main  loops  of  unblocked  and  blocked  matrix 
multiplication  are  illustrated  in  Figure  16  and  17 
respectively. 

for  (i  =  0;  i<N;  ++i)  { 

for  (j=0;  j<N;  ++j)  { 

for  (1  =  0/  i<N;  ++i)  { 

c[i][j]  =  c[i][j]  +  temp*b[k]  [j]  ; 

} 


} 

} 

Figure  16 

Main  Loops  of  Uziblocked  Matrix  Multiplication 


for  (kk=0;  kk<N;  kk=kk+B)  { 

for  (jj=0;  jj<N;  jj=jj+B)  { 
for  (1=0;  i<N;  { 

for  (k=kk;  k<mln (N, kk+B) ;  ++k)  { 
c[l][j}  =  0; 
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temp  =  a[i][k]; 

M  =  min  (N,  jj  +B)  / 
for  (j=jj;  j<M;  ++j)  { 

c[i][j]  =  c[i][j]  +  temp*b[k]  [j] : 


Figure  17 

Main  Loops  of  Blocked  Matrix  Multiplication 

First,  the  code  in  Figure  17  is  blocked  at  both  data  cache 
and  register  levels.  a[i][k]  is  register  allocated  ana  Lh? s 
relatively  increases  the  performance  of  blocked  code. 
Secondly,  min  function  is  handled  with  a  macro.  In  the  very 
first  experiments,  the  min  function  was  placed  in  the  for 
statement  (as  in  Figure  18) . 


temp  =  a[i][k]; 

for  (j=jj;  j<min(N,  jj+B) ;  ++->''  { 

c[i][j]  =  c[i][j]  +  temp*b[k]  [j]  ; 


Figure  18 

The  for  statement  with  min  function  in  it 

Removing  it  as  in  Figure  19  improved  the  performance 
drastically. 
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temp  =  a  [ij  [k]  ; 

M  =  min  (N,  jj+B)  ; 
for  (j=jj;  j<M;  ++j)  { 

c[i][j]  =  c[i][j]  +  temp*b[k]  [j]  ; 


Figure  19 

The  for  statement  without  min  function  in  it 

a.  Cache  Performance  Experiments  with  Slocked  Codes 
on  Solboume  S4000  SPARC  station 

In  Figure  20,  the  performance  of  blocked  matrix 

multiplication  is  plotted  on  Solbourne  S4000  SPARC  station. 

The  code  is  compiled  without  enabling  the  code  optimization  of 

gcc  (the  gnu  compiler)  .  The  graph  in  Figure  20  plots  the 

performance  levels  obtained  for  three  slightly  different 

matrix  sizes  across  a  range  of  blocking  factors.  Blocking 

effects  are  not  observed  when  code  optimization  of  the 

compiler  is  not  enabled.  There  is  a  negligible  change (<6%) 

between  performances  of  blocked  and  non-blocked  code  when  no 

optimization  switch  is  used.  The  reason  is  that  the  number  of 

instructions  in  the  non-optimized  code  was  much  bigger  than 

that  of  the  optimized  one  and  the  percentage  of  memory  access 

instructions  to  total  memory  instructions  was  very  small.  In 

other  words,  the  code  was  running  inefficiently.  In  contrast, 

optimized  code  had  a  smaller  number  of  memory  instructions 
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Figure  20 


The  Performance  of  Blocked  Matrix  Multiplication  on 
Solboume  S4000  SPARC.  Compiler  optimization  para¬ 
meters  are  not  enabled  during  compilation  process. 
The  ftbbreviation  N.BL.  means  No  BLocking 

and  most  of  them  were  to  access  memory,  implying  its 
efficiency . 

Figure  21  is  similar  to  Figure  20  except  that  gcc 
compiler  optimization  was  invoked.  The  code  was  compiled  by 
the  command  "  cc  -02  source_code .  c  "  at  the  prompt .  The 
performance  on  ::  Ibourne  S4000  SPARC  station  gets  better  when 
the  block  size  is  16.  Because  it  hw..  2kB  2- way  set- 

associative  data  cache  and  a  word  length  of  4  bytes.  xiixs 
corresponds  to  a  data  cache  size  of  512  words.  For  a  local 
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memory  of  size  C,  the  optimal  blocking  factor  is  roughly 
sgrt(C) [9] .  Since  the  data  cache  in  SPARC  station  is  two-way 
set  associative,  sgrt (512/2)  is  equal  to  16  and  therefore  a 
block  size  of  16  x  16  results  in  a  better  performance.  As 
observed  in  Figure  21,  optimized  blocked  code  with  a  blocking 
factor  of  16  is  40%  better  in  performance  than  the  one  with  no 
blocking. 


Figure  21 

The  Performance  of  Blocked  Matrix  Multiplication 
on  Solboume  S4000  SPARC  station.  Compiler 
optim^ization  switch  -02  is  enabled 
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Jb.  Cache  Performance  Experiment  a  with  Blocked  Codes 
on  DECatation  3100 

Figure  22  shows  the  graph  for  the  compiler 
optimized  blocked  code.  Blocking  optimizes  the  code  and  -A/e 
also  observe  a  better  performance  of  blocking  if  the  code  is 
compiler  optimized  for  DECstation  3100.  As  shown  in  Figur' 
22,  matrix  sizes  295  and  300  present  almost  the  same  perfor¬ 
mance  of  the  data  cache  and  consistent  increase  until  we  reach 
block  size  of  72.  At  block  size  of  80,  both  of  them  show 
their  highest  performances.  After  this  block  size,  no  big 
change  is  observed  in  the  plottings  of  these  two  matrix  sizes. 
Matrix  size  293  present  a  steadily  increasing  performance  like 
the  previous  two.  But,  after  achieving  its  best  performance 
at  block  size  of  40,  it  behaves  differently  and  the  perfor¬ 
mance  for  the  data  cache  degrades  drastically  beginning  from 
the  block  size  of  56.  The  lowest  performance  for  three  matrix 
sizes  is  obtained  when  no  blocking  is  applied.  Because,  the 
interference  misses  of  the  cache  increases  in  this  case.  This 
means  that  the  data  which  we  can  reuse  are  replaced  by  self- 
interfering  array  elements. 

c.  Comparison  of  The  Two  Machines  with  respect  to 
Speedup  measure 

The  two  machines,  DECstation  3100  and  Solbourne 
S4000  SPARC  station  are  coiitijared  with  to  speedup  for 

different  blocking  factors  in  Figure  23.  Speedup  is  calcu¬ 
lated  by  dividing  each  performance  value  by  performance  of 
unblocked  code  for  the  respective  machine.  Solbourne  S4000 
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Figure  22 

The  Performance  of  CoxDpiler  Optimized  Blocked  Matrix 
Multiplication  on  DECstation  3100.  Compiler  optimi¬ 
zation  switch  -02  is  used  for  compilation  process 

SPARC  station  shows  the  highest  increase  in  performance  at 
block  size  of  16  for  three  matrix  sizes.  This  is  related  to 
the  data  cache  size.  But  this  increase  does  not  last  long  and 
a  decrease  is  observed  after  block  size  16  which  is  the 
optimal  blocking  factor  for  SPARC  station.  The  speedup 
decreases  consistently  after  blocking  factor  of  32.  As  the 
block  size  increases  over  this  value,  the  performance  of  the 
data  cache  decreases  due  to  self-interference . 

For  matrix  sizes  295  and  300,  DECstation  3100 
shows  consistent  performance  increase  as  the  blocking  factor 
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BLOCKING  FACTOR 


SPA-293  SPA-295  SPA-300 

DEC-293  -N-  DEC-295  DEC-300 


Figure  23 

Comparison  o£  DECstation  3100  and  Solboume  S4000 
SPARC  station  with  respect  to  speedup  vs  blocking 
factors  for  matrix  sizes  of  293,  295,  30C 

increases.  But,  matrix  size  of  293  does  not  behave  like  the 
previous  two.  The  performance  of  data  cache  gets  worse  after 
the  blocking  factor  of  56.  This  shows  that  very  similar 
matrix  sizes  may  behave  differently.  Data  cache  in  DECstation 
3100  displays  better  performance  for  a  wide  range  of  blocking 
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factors  where  the  data  cache  in  Solbourne  S4000  SPARC  station 

performs  its  best  just  for  blocking  factor  of  16, 

d.  Cache  Performance  Experiments  with  Blocked 
Unfolded  Codes  on  Solbourne  S4000  SPARC  station 

In  an  attempt  to  furthermore  increase  in  the 

performance  of  the  data  cache,  experiments  with  loop  unfolding 

(unrolling)  technique  were  performed.  Loop  unfolding  is  the 

process  of  replacing  the  iterations  of  a  loop  with  noniterated 

straight-line  code  [Ref.  16].  Shown  in  Example  4  are  the 

codes  for  matrix  multiplication  without  and  with  any 

unfolding.  The  code  is  unfolded  3  times,  where  U  is  the 

variable  to  show  how  many  times  the  loop  is  unfolded. 

Example  4 . 

for  (k=k}c;  k<min  (N,  kk+B)  ;  ++k)  { 
temp  =  a[i]  [k] ; 

M  =  min (N,  jj +B) : 

for  (j=jj;  j<M;  j=jj+B)  {  /*  no  unfolding  */ 

c[ij[j]  +=  temp  *  b[k][j]; 

} 

is  unfolded  to 

for  (k=kk;  k<min (N,  kk+B) ;  ++k)  { 
temp  =  a[i]  [k]  ; 

M  =  min(N,  jj+B)  ; 

if  (M  %  U==0)  {  /*  control  statement  for  unfolding  */ 

for  j<M;  j=j+U)  {  /*  U=3,  unfolded  3  times  */ 
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C[i][j]  += 

temp 

*  b[k]  [j]; 

C[i][j+1]  += 

temp 

*  b[k][j+l]; 

c[i][j+2]  += 

temp 

*  b[k]  [j+2]  ; 

} 

} 

else  { 

for  (j=jj:  j<M;  ++j)  { 

c[i]  [j]  +=  temp  *  b[k][j]; 

} 

} 

The  data  obtained  from  the  experiments  with  unfolded  blocked 
matrix  multiplication  code  shows  that  the  overhead  on  the 
control  statement  plays  an  important  role  in  reducing  the 
p0if f ormance .  So*  blocked  codes  with  and  without  unfolding 
achieve  negligibly  different  performances.  Figure  24 
illustrates  this  cache  behavior  for  the  unfolded  8  times. 
Unfolding  did  not  improve  the  performance  on  the  blocked  code 
because  of  the  control  statement  overhead.  The  degree  of 
unfolding  the  loop  iterations  affects  the  performance 
slightly . 
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Figure  24 

Performance  for  a  blocked  and  8  times  unfolded 
matrix  multiplication  code,  Solboume  S4000 
SPARC  station 

The  graph  in  Figure  25  plots  the  performance  levels  for 
blocked  matrix  multiplication  code  with  16  and  8  times 
unfolded.  As  presented  in  Figure  25,  the  performances  of  16 
and  8  times  unfolded  codes  are  not  extremely  different  from 
each  other.  This  again  shows  that  the  control  statement  (if 
statement)  in  the  code  before  the  unfolded  lines  of  loop 
iterations  is  the  main  reason  of  the  overhead,  rather  than  the 
iterations  which  are  not  unfolded. 
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Figure  25 

Performance  for  a  blocked  matrix  multiplication  code 
unfolded  16  and  8  times,  Solboume  S4000  SPARC  station 
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III.  ALGORITHMS  TO  IMPROVE  DATA  CACHE  PERFORMANCE 


A.  SELF- INTERFERENCE  IN  CACHE 

A  reused  variable  of  an  array  will  miss  in  the  cache  if 
the  references  between  reuse  occupy  the  same  cache  location, 
thus  leading  to  self-interf erence  misses.  If  the  accesses  are 
done  in  a  stride  one  manner,  then  no  interference  will  occur 
unless  the  number  of  accessed  data  is  bigger  than  the  cache 
size.  Otherwise,  since  the  access  pattern  of  a  stride  one  or 
a  constant  stride  array  is  uniform,  its  self-interf erence 
pattern  also  becomes  uniform  and  increases  the  cache  misses. 
In  order  to  avoid  self  interference  totally,  "the  largest 
block  size  that  does  not  suffer  from  any  self -interference 
(Be)”  is  computed. 

B.  THE  CRITICAL  BLOCKING  FACTOR 

Lam,  Rothberg,  and  Wolf  have  developed  an  algorithm  to 
find  the  largest  square  block  that  avoids  self-interf erence 
for  a  given  matrix  size.  This  algorithm  tailors  the  blocking 
factor  according  to  the  problem  size,  i.e.,  matrix  size,  to 
improve  the  average  miss  rate  [Ref.  9]  .  By  doing  so,  they 
intend  to  improve  the  average  miss  rate  and  to  reduce  the 
variance.  Since  the  periodicity  in  the  addressing  of  a 
direct-mapped  cache  and  the  constant-stride  accesses  are 
obvious,  it  is  relatively  easy  to  determine  B^.  Their 
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determine  the  largest  square  block  with  no  self-interference 
is  shown  in  Figure  26. 

algorithm  FindB(N,  C  :  integer)  return  integer; 
maxWidth,  addr,  di,  dj,  X,  Y,  N  :  integer; 
maxWidth  =  min(N,  C); 
addr  =  N/2; 
while  true  do 

addr  =  addr  +  C; 
di  =  addr  div  N; 
dj  =  abs((addr  mod  N)  -  N/2); 
if  (di  >=  min  (maxWidth,  dj)) 
return  min  (maxWidth,  di)  ; 
else 

maxWidth  =  di  >=  min  (maxWidth,  dj; 
end  while; 
end  algorithm; 


Figure  26 

Lam's  Algorithm  to  compute  the  Critical 
Blocking  Factor  Be 

1.  Sensitivity  of  B^  with  respect  to  Matrix  Size 

Be  obtained  from  Lam's  algorithm  for  some  matrix  sizes 
draws  attention  to  its  sensitivity  with  respect  to  matrix 
size.  Figure  27  shows  the  critical  Oxocking  factor  for  a  some 
matrix  sizes  between  293  and  323. 
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MATRIX  SIZE 


Figure  27 

Be  versus  Matrix  Size.  Solboume  S4000  SPARC  Station 

The  critical  blocking  factor  cannot  be  larger  than  sqrt(C), 
so  the  run  time  of  the  algorithm  is  0(N/sqrt(C))  [Ref.  9], 
Figure  28  plots  the  performance  of  the  data  cache  for  the 
values  shown  in  Figure  27.  The  performance  is  greatly 
affected  by  value  of  Be,  dropping  to  1.7  Mflops  for  a  321x321 
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Figure  28 

Data  Cache  Performemce  versus  Hatrix  Size.  Be  is 
obtained  by  using  Lam's  Algorithm  for  Matrix  Size. 
Solboume  S4000  SPARC  Station. 


matrix  and  jvimping  to  2.0  Mflops  (an  increase  of  17%)  for  a 
323x323  matrix, 

2.  Effect  of  Declared  Matrix  Size  on  Be 

It  is  important  to  obtain  consistent  performance  from 
an  algorithm.  Small  changes  In  th“  values  of  the  inputs 
generally  should  not  cause  huge  swings  in  the  performai*^  of 
the  algorithm.  Users  tend  to  use  arbitrary  array  sizes  larger 
than  maximum  expected  problem  size. 
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If  we  use  Lam's  Algorithm  which  changes  the  block  size 
by  searching  for  a  better  Be,  then,  for  a  given  matrix  size, 
we  can  keep  increasing  the  size  of  the  matrix  and  determine 
another  Be  until  we  find  a  block  size  which  gives  a  better 
performance,  preferably  sqrtlC)  where  C  is  an  integer. 

In  Figure  29,  the  performance  variation  is  shown  for 
actual  matrix  sizes  of  293,  295,  and  300,  where  Be  is  computed 
while  extending  the  size  of  the  matrix.  Here,  we  demonstrate 
the  sensitivity  of  Lam's  Algorithm  to  the  changes  on  the 
actual  matrix  size.  Even  a  small  change  on  one  dimension  of 
array  b[k]  [j]  affects  the  performance  of  the  code  blocked  with 
a  blocking  factor  for  a  matrix  size  as  big  as  that  new 
dimension.  For  example,  when  we  use  actual  matrix  size  of  293 
without  any  change  on  any  dimension,  data  cache  performance  is 
2.7  Mflops.  If  we  declare  one  dimension  of  the  same  matrix  to 
be  310,  then  the  performance  jumps  up  to  3.05  Mflops  (an 
increase  of  13%) .  For  dimension  of  321,  the  performance 
decreases  from  3.05  Mflops  to  2.2  (a  decrease  of  38%). 

Since  calculation  of  Bp  is  sensitive  to  this  small 
change,  we  can  draw  insights  to  this  fact  as  follows: 

1.  Block  size  is  highly  sensitive  to  matrix  size. 

2.  Some  bigger  block  sizes  after  this  change  shows 
that  cache  is  not  filled  as  much  as  possible  with 
the  block  size  computed  with  the  algorithm.  In 
other  words,  the  bigger  portion  of  cache  used, 
the  better  cache  performance  for  some  array 
sizes.  As  long  as  the  in  the  cache,  in  a  certain 
range,  if  we  increase  the  block  size,  it  helps 
decreasing  misses.  The  following  equation  in 
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EXTENDED  MATRIX  SIZE 


-m-  N=293  --s-  N=295  N=300 


Figure  29 

Extended  Dimension  versus  Performance  for  Actual 
Matrix  Sizes  of  293,  295,  and  300.  Solboume  S4000 
Sparc  Station 


Figure  30  models  the  miss  rate  in  terms  of  parameters  used  in 
the  blocked  matrix  multiplication  and  calculates  the  total 
number  of  cache  misses  [Ref.  9]. 

-  lf[2/B  +  Si(b)  +  3(1  ~  Si(b))B/C  *  B/C] 

where  =  the  number  of  operations  performed, 
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B 


=  blocking  factor. 


C  =  cache  size, 

Si(b)=  (1  -  =  self -interference  of 

accessing  B-1  rows  of  array  b 
before  the  same  data  is  reused. 

Figure  30 

Equation  to  cooipute  total  number  of  misses  in  cache 

According  to  this  equation,  there  are  N^(2/B)  compulsory 
misses,  misses  that  are  intrinsic  to  the  algorithm  given  the 
blocking  factor  and  cannot  be  avoided  even  if  the  address 
mapping  is  perfect.  The  factor  Si(b)  is  due  to  self  inter¬ 
ference  among  the  elements  in  the  b  array.  Any  blocking 
factor  lower  than  the  critical  blocking  factor  does  not  cause 
self  interference.  The  other  two  terms  are  due  to  cross 
interference  between  different  variables.  Example  5  below  is 
presented  to  quantitavely  demonstrate  the  effect  of  B^. 

Example  5. 

If  actual  matrix  size  =  293  and  C  =  256,  then  B  =  7  is 
computed  with  Lam's  Algorithm  presented  in  Figure  26.  With 
these  values,  the  total  number  of  misses  =  1.17N\  If  we 
fix  the  cache  size  and  extend  the  declared  matrix  size  to 
3  04  as  determined  with  our  seacrh  technique,  then  B^  =  16  is 
computed  with  these  values,  the  total  number  of  misses  = 
0.68N^.  We  showed  that  the  total  number  of  misses  for  a 
block  size  obtained  with  Lam's  Algorithm  is  worse  than  that 
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of  obtained  with  the  extended  matrix  size  (1.17/0.68  =  1.72, 
72%  degrades  the  performance) .  Therefore,  having  a  bigger 
blocking  factor  and  an  extended  matrix  dimension  leads  to 
a  better  daca  cache  performance,  while  decreasing  the  miss 
rate . 

C .  CHANGING  ARRAY  SIZE  VIA  A  SEARCH  TECHNIQUE  AND  DETERMINING 
THE  CRITICAL  BLOCKING  FACTOR 

We  present  a  search  technique  to  find  values  of  Bp  larger 
than  those  computed  using  the  algorithm  in  [Ref.  9].  This 
technique  computes  Be  by  searching  through  ail  blocking 
factors  for  the  range  of  matrix  sizes  up  to  10%  bigger  than 
the  actual  matrix  size.  It  mutually  employs  the  algorithm  in 
Figure  26  until  it  returns  the  biggest  blocking  factor,  less 
than  or  equal  to  sqrt(C),  where  C  is  cache  size,  and  bigger 
than  1,  which  is  the  smallest  blocking  factor  that  we  can 
assign.  We  demonstrated  that  this  blocking  factor  improves 
data  cache  performance  as  shown  in  Example  5.  The  following 
C  language  code  in  Figure  31  illustrates  this  technique. 

/*  algorithm  COMPUTE_B  */ 

^define  min(a,b)  ( (a<b)  ?  a:b)  /*  min  function  */ 
int  algodnt,  double); 
main  ( ) 

{ 

int  k,  N,  Nmax,  B,  Bmax=l,  i,  Nlimit; 
double  C; 

printf(”  Enter  the  size  of  the  matrix:  \n”) ; 
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scanf  ( ''%d”,  &N)  ; 

printf("  Enter  the  size  of  the  cache: \n" ) ; 
scanf  ( "%d" ,  &C)  ; 

Nmax  =  N; 

k  =  N  *  0.10;  /*  max  allowable  matrix  size  increase  */ 

Nlimit  =  N  -f-  k;  /*  max  matrix  size  */ 
for  a  =  N;  i  <=  Nlimit;  ++i)  { 

B  =  algo(N,  C) ; 

if  ( (B  <=  sqrt(C))  &&  (Bmax  <  B) )  { 

Bmax  =  B; 

Nmax  =  N; 

} 

} 

print  f  ( "New  Matrix  Size=  %d\n”,  Nmax); 
print f ( "Blocking  Factor=  %d\n",  Bmax); 

int  algo  (Nr  C)  /*  Lam's  Algorithm  (algorithm  FindB) 

in  Figure  1  */ 


Figure  31 

A  search  technique  coupled  with  Lam's  Algorithm 

to  find  Be 

The  technique  keeps  extending  the  actual  matrix  size  within 
the  limit  of  10%  of  the  actual  matrix  size.  For  every  size, 
it  computes  by  repetitively  applying  Lam's  Algorithm  in 
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Figure  26,  and  retrieves  the  first  biggest  blocking  factor  and 
the  corresponding  extended  matrix  size. 

Here  the  choice  of  10%  is  arbitrary.  In  practice,  we  do 
not  need  to  extend  both  rows  and  columns  of  the  actual  matrix. 
Because,  the  advantage  of  this  approach  is  that  it  finds  an 
extended  matrix  size  which  results  in  a  better  data  cache 
performance  than  that  of  Lam's  Algorithm.  Miss  rates  obtained 
from  the  equation  in  Figure  30  by  substituting  the  values  for 
two  different  matrix  sizes  show  that  matrix  size  and  blocking 
factor  from  our  algorithm  present  a  better  total  number  of 
misses.  Moreover,  in  Example  5,  cache  portion  used  for  the 
blocking  factor  computed  with  Lam's  Algorithm  is 

(7  x  7)  /  256  =  0.19  -->  19%  of  the  cache, 

while  the  blocking  factor  computed  with  our  technique  allows 

(16  X  16)  /  256  =1.00  -->  100%  of  the  cache  to  be  utilized. 

Therefore,  the  technique  presented  above  als:  '' es  us  to 

utilize  data  cache  optimally.  However,  this  approach  suffers 
from  using  additional  columns/rows  of  a  matrix. 

D.  A  NEW  ALGORITHM  TO  FIND  THE  CRITICAL  BLOCKING  FACTOR  WITH 
NO  SELF- INTERFERENCE 

The  drawback  of  the  search  technique  in  the  prf:  jus 
section  is  that  it  uses  Lam's  algorithm  which  has  a  complexity 
of  0(N/sqrt(C))  [Ref.  9].  For  each  extended  matrix  size, 
using  this  algorithm  becomes  an  overhead  for  the  search 
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technique.  Additionally,  we  have  no  guarantee  for  an 
acceptable  choice  of  Be. 

In  this  section,  we  introduce  a  new  algorithm  which 
improves  data  cache  performance  by  determining  the  critical 
blocking  factor  with  no  self-interf erence .  The  algorithm  is 
shown  in  Figure  32. 

/*  DIRECT  Algorithm  */ 
int  geddnt,  int) ; 
main  () 

{ 

int  SIZE,  C,  C',  OLD_SIZE,  B=l,  X,  GCD; 

OLD_SIZE  =  SIZE; 

if  (sqrt(C)  is  not  an  integer  ) 

C'  =  C/2; 

else 

C'  =  C; 

X  =  sqrt  (C) - (SIZE  %  sqrt  (C' ) ) ; /*  these  two  lines  set 

matrix  */ 

SIZE  =  SIZE  +  X;  /*  to  be  a  multiple  of 

sqrt  (C)  */ 

do  { 

GCD  =  gcd( (SIZE/sqrt  (C' ) ) ,  sqrt  (C' ) ) ;  /*  ged  test  */ 

if  (GCD  =  1) 

B  =  sqrt(C'); 
else 

SIZE  =  SIZE  +  sqrt (C') ; 
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}  while  (B  /=  sgrt(C')); 
gcddnt  VI,  int  V2)  { 
int  temp; 
while  (V2)  { 
temp  =  V2; 

V2  =  VI  %  V2; 

VI  =  temp; 

} 

return  VI; 

} 

} 

Figure  32 

A  New  Algorithm  to  find 

Direct  Algorithm  allows  us  to  use  sqrt(C)  as  optimal  blocking 
factor  if  sqrt(C)  is  an  integer  all  the  time.  If  not,  the 
optimal  blocking  factor  is  considered  as  sqrt(C/2),  because  we 
obtain  a  multiple  of  two  for  C/2 .  The  underlying  principle  is 
to  cover  data  cache  as  much  as  possible  at  the  expense  of 
potentially  using  more  memory. 

We  assume  that  the  matrix  has  sqrt(C)  elements,  where  C  is 
the  cache  size.  If  each  element  in  the  matrix,  which  we  call 
it  super_  element  E,  contains  sqrt(C)  elements  with  a  stride 
S  between  them  and  if  cache  is  divided  by  the  size  of  sqrt(C) 
super_elements,  then  (C  /  E)  corresponds  to  the  number  of 
blocks  in  the  cache.  So,  if  S  and  (C  /  E)  are  relatively 
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prime,  which  also  means  that  their  greatest  common  divisor 
(gcd)  is  1,  then  every  super_element  will  be  placed  in  a  new 
super_location  up  to  (C  /  E)  different  super_locations . 

The  advantages  of  Direct  Algorithm: 

-  The  time  consumed  to  compute  the  extended  matrix  size  and 
the  corresponding  Be  is  shorter  than  the  one  in  Lam's 
Algorithm.  Therefore,  we  save  time. 

-  The  blocking  factor  is  always  sgrt(C)  or  sqrt{C/2)  due  to 
the  value  of  cache  size,  at  which  we  get  a  better 
performance  than  that  of  Lam's  Algorithm. 

The  disadvantages  of  Direct  Algorithm: 

-  As  in  the  previous  section,  this  algorithm  also  suffers 
from  using  additional  rows/columns  which  is  an  overhead 
for  the  computation.  The  maximum  addition  in  size  is  2  * 
SQRT(C)  -1  (if  SQRT(C)  is  an  integer) . 

This  algorithm  is  experimented  on  Solbourne  S4000  SPARC 
station  and  DECstation  3100.  Figure  33  plots  the  performance 
of  the  data  cache  for  a  wide  range  of  matrix  sizes  on  SPARC 
station.  Figure  34  plots  the  actual  matrix  size  versus  the 
extended  matrix  size  obtained  with  Direct  Algorithm. 

This  algorithm  finds  a  matrix  size  suitable  to  be  used 
with  a  block  size  of  sqrt(C)  if  it  is  an  integer  (or  sqrt(C/2) 
if  not)  .  Since  cache  size  is  512  words  for  a  two-way  associa¬ 
tive  data  cache  in  SPARC  station,  sqrt (512/2)  =  sqrt(256)  =16 
is  an  integer.  So,  we  use  almost  100%  of  the  cache.  The 
above  performance  curve  in  the  graph  shows  the  performance  for 
a  matrix  size  determined  by  Direct  Algorithm  with  a  block  size 
of  sqrt(C).  It  presents  a  better  performance  level  compared 
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DIRECT  ALGORITHM  ^  LAM  S  ALGORITHM 


Figure  33 

Performance  levels  with  optimal  block  size  of  16 
(without  any  chemge  on  matrix  size)  and  with  a  block 
size  of  sqrt(C)  (for  extended  matrix  size  determined 
by  the  algorithm  above)  on  the  data  cache  of  Solboume 
S4000  SPARC  station. 
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Figure  34 

The  Actual  Matrix  Size  versus  the  Extended  Matrix 
Size  obtained  with  Direct  Algorithm.  Solboume 
S4000  SPARC  station 
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to  the  one  below  and  behaves  very  consistent.  The  variation 
in  performance  for  the  one  below  is  due  to  the  interference 
misses  in  the  data  cache. 

In  the  experiment  on  DECstation  3100,  three  matrix 
sizes  (293,  295.  300)  are  used  and  performance  level  of  the 
data  cache  is  observed  for  the  values  obtained  with  the  new 
algorithm  and  the  best  blocking  factors  obtained  previously. 

Figure  35  shows  the  two  performance  levels  of  the  data 
cache  on  DECstation  3100. 

Figure  36  plots  the  actual  matrix  size  versus  the  extended 
matrix  size  obtained  with  Direct  Algorithm.  DECstation  3100 
has  8K  double  word  direct  mapped  cache  where  sqrt(C)  is  not  an 
integer.  So,  Direct  Algorithm  gets  half  of  the  cache  and 
determines  the  new  matrix  size  shown  in  Figure  36  for  sqrt 
(C/2)  . 

Our  algorithm  performs  its  best  if  the  number  of  distinct 
array  references  to  data  cache  is  less  than  or  equal  to  the 
associativity  of  data  cache.  Otherwise,  interference  among 
the  references  may  occur. 


Figure  35 


Performance  levels  witb  optimal  block  sizes  (with  no  change 
on  matrix  size)  and  with  a  block  size  of  sqrt(C/2)  (with 
matrix  size  determined  by  the  algorithm  above)  for  the  data 
cache  on  DBCstation  3100 
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Figure  36 

The  Actual  Matrix  Size  versus  the 
Extended  Matrix  Size  obtained  with 
Direct  Algorithm.  DECstatlon  3100 
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IV.  CONCLUSIONS  AND  RECOMMENDATIONS  FOR 
FURTHER  RESEARCH 


In  Chapter  I,  we  reviewed  the  notion  of  reuse  and  reported 
the  importance  of  the  number  of  cache  misses  for  RISC  proces¬ 
sors.  Since  there  is  an  amazingly  large  reuse  of  data  in 
cache,  we  focused  on  some  techniques,  namely  blocking 
techniques,  which  are  used  to  reduce  memory  traffic  and 
exploit  this  data  reuse. 

In  Chapter  II,  we  performed  several  experiments  on 
blocking  and  presented  the  results  similar  to  those  in  [Ref. 
9]  for  a  better  understanding  of  the  effect  of  blocking  on  the 
performance  of  two  different  data  caches. 

In  Chapter  III,  we  demonstrated  the  sensitivity  of  the 
algorithm  in  [Ref.  9]  and  the  sensitivity  of  blocking  factor 
obtained  by  that  algorithm  to  the  declared  array  size.  In 
fact,  this  sensitivity  shows  that  the  performance  of  the 
algorithm  is  very  unstable  related  to  the  size  of  the  matrix. 
To  remedy  this  problem,  we  introduced  two  algorithms:  The 
first  one  is  a  search  technique  which  finds  values  of  Be 
larger  than  those  computed  using  the  algorithm  in  [Ref.  9]. 
This  technique  computes  Be  by  searching  through  all  blocking 
factors  for  the  range  of  matrix  sizes  up  to  10%  bigger  than 
the  actual  matrix  size.  It  iteratively  employs  the  algorithm 
in  [9]  until  it  returns  the  biggest  blocking  factor,  less  than 
or  equal  to  sqrt(C),  where  C  is  cache  size,  and  bigger  than  1, 
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which  is  the  smallest  blocking  factor  that  we  can  assign.  The 
other  is  Direct  Algorithm  which  improves  the  performance  of 
workstations  in  some  specific  applications.  This  algorithm 
finds  a  matrix  size  suitable  to  be  used  with  a  block  size  of 
sqrt(C)  if  it  is  an  integer  (or  sqrt(C/2)  if  not).  It  gives 
the  best  results  when  the  number  of  distinct  array  references 
is  less  than  or  equal  to  the  associativity  of  data  cache. 

For  further  research  we  recommend  looking  at  the 
algorithms  in  situations  where  the  number  of  distinc:  forms  of 
array  references  is  greater  than  the  cache  associativity 
factor . 
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